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[57] ABSTRACT 

A computer-based system and method of retrieving infor- 
mation pertaining to Web documents on a computer network 
is disclosed. The method includes maintaining an address 
map that associates primary addresses with secondary 
addresses. A primary address includes a network retrieval 
protocol and a network address. The secondary address may 
include a different retrieval protocol or a different network 
address from the primary document address. A Web crawler 
retrieves a Web document using the primary document 
address, and determines whether the address map contains a 
secondary document address prefix corresponding to the 
primary document address prefix. If a secondary document 
address prefix exists, the Web crawler creates a secondary 
address, retrieves additional information pertaining to the 
Web document, and combines the additional information 
with the data retrieved from the Web document. The com- 
bined data may be stored in an index, and subsequently used 
to perform a document search. 

28 Claims, 5 Drawing Sheets 
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1 

METHOD OF WEB CRAWLING UTILIZING 
ADDRESS MAPPING 

FIELD OF THE INVENTION 

The present invention relates to the field of network 
information software and, in particular, to methods and 
systems for retrieving data from network sites. 

BACKGROUND OF THE INVENTION 

10 

In recent years, there has been a tremendous proliferation 
of computers connected to a global network known as the 
Internet. A "client" computer connected to the Internet can 
download digital information from "server*' computers con- 
nected to the Internet, Client application software executing J 5 
on client computers typically accept commands from a user 
and obtains data and services by sending requests to server 
applications running on server computers connected to the 
Internet. A number of protocols are used to exchange com- 
mands and data between computers connected to the Inter- 20 
net. The protocols include the File Transfer Protocol (FTP), 
the Hyper Text Transfer Protocol (HTTP), the Simple Mail 
Transfer Protocol (SMTP), and the "Gopher" document 
protocol. 

The HTTP protocol is used to access data on the World 25 
Wide Web, often referred to as "the Web." The World Wide 
Web is an information service on the Internet providing 
documents and links between documents. The World Wide 
Web is made up of numerous Web sites around the world 
that maintain and distribute Web documents. A Web site may 30 
use one or more Web server computers that are store and 
distribute documents in one of a number of formats includ- 
ing the Hyper Text Markup Language (HTML). An HTML 
document contains text and metadata or commands provid- 
ing formatting information. HTML documents also include 35 
embedded "links" that reference other data or documents 
located on any Web server computer. The referenced docu- 
ments may represent text, graphics, audio, or video in 
respective formats. 

A Web browser is a client application that communicates 40 
with server computers via FTP, HTTP, and Gopher proto- 
cols. Web browsers receive Web documents from the net- 
work and present them to a user. Internet Explorer, available 
from Microsoft Corporation, of Redmond, Wash., is an 
example of a popular Web browser application. 45 

An intranet is a local area network containing Web servers 
and client computers operating in a manner similar to the 
World Wide Web described above. Typically, all of the 
computers on an intranet are contained within a company or 
organization. 

Web crawlers are computer programs that automatically 
retrieve numerous Web documents from one or more Web 
sites. A Web crawler processes the received data, preparing 
the data to be subsequently processed by other programs. 
For example, a Web crawler may use the retrieved data to 
create an index of documents available over the Internet or 
an intranet. A "search engine" can later use the index to 
locate Web documents that satisfy a specified criteria. 

Web crawlers use the same protocols as other programs, 
such as Web browsers and file system explorers, to access 
Web documents. The type of data that a Web crawler 
retrieves is determined by the protocol used. For example, 
the HTTP protocol does not provide a mechanism to obtain 
an access control list corresponding to a Web document. In 
another example, a Web document may have an associated 
second Web document at a different address, the second Web 
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document co ntain ing information pertaining to the first Web 
document. HTTP does not provide an easy mechanism for 
obtaining related data from multiple sources and combining 
the data. 

It is desirable to have a mechanism by which a Web 
crawler can increase the amount of information it obtains for 
each Web document. Preferably, such a mechanism will 
provide a Web crawler with a way to obtain information 
pertaining to a Web document by using more than one 
protocol. Additionally, a preferable mechanism will also 
provide a Web crawler with a way to obtain information 
pertaining to a Web document from a source other than the 
Web document itself. The present invention is directed to 
providing such a mechanism. 

SUMMARY OF THE INVENTION 
In accordance with this invention, a system and computer- 
based method of retrieving data from a computer network 
are provided. The method includes performing a Web crawl, 
by retrieving a Web document and subsequently retrieving 
additional Web documents based on addresses specified in 
hyperlinks within each Web document. For each Web 
document, an address map is checked to determine whether 
the document has a secondary document address corre- 
sponding to the first, or primary, document address. If a 
secondary document address exists, the secondary document 
address is used to retrieve data pertaining to the Web 
document. 

In accordance with other aspects of this invention, a 
document address includes a protocol specification and a 
network address specification. The secondary document 
address may differ from the primary document address by 
having different specified protocols, different network 
addresses, or both. The secondary document address allows 
the retrieval of data not easily obtained using the first 
document address. The additional data may include data 
obtainable by using file, system commands. 

In accordance with still other aspects of this invention, 
after retrieving a Web document using a primary document 
address and additional data pertaining to the Web document 
using a secondary document address, the data obtained using 
the secondary document address is stored with the data 
obtained from the Web document. The combined data may 
be stored in a document index, which is subsequently used 
to locate the Web document. In accordance with yet still 
other aspects of this invention, an address map is main- 
tained. The address map preferably includes a set of entries, 
each entry having a portion of a primary Web address and a 
corresponding portion of a secondary Web address. 

As will be readily appreciated from the foregoing 
description, a system and method for retrieving data from 
Web documents on a computer network provide a way of 
retrieving and storing information pertaining to a Web 
document, wherein the information is not easily obtainable 
using a single retrieval protocol and network address. The 
invention allows a Web crawler to retrieve file system 
information, such as an access list, corresponding to a Web 
document, wherein the Web document is originally retrieved 
using a protocol that does not provide the file system 
information. The invention also allows data from two dis- 
tinct Web documents to be combined, wherein a primary 
Web document has a corresponding secondary Web docu- 
ment containing information pertaining to the primary Web 
document. 

BRIEF DESCRIPTION OF THE DRAWINGS 
The foregoing aspects and many of the attendant advan- 
tages of this invention will become more readily appreciated 
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as the same becomes better understood by reference to the the personal computer 20, such as during startup, is stored in 

following detailed description, when taken in conjunction ROM 24. The personal computer 20 further includes a hard 

with the accompanying drawings, wherein: disk drive 27 for reading from and writing to a hard disk, not 

FIG. 1 is a block diagram of a general purpose computer shown > a magnetic disk drive 28 for reading from or writing 

system for implementing the present invention; 5 to a removable magnetic disk 29, and an optical disk drive 

r?, n i * ui i j- -ii « 4- « i 30 for reading from or writing to a removable optical disk 31 

FIG. 2 is a block diagram illustrating a network such && & ^ RQM QX ^ ^ ^ ^ hard ^ 

architecture, in accordance with the present invention; drfve 2?> magnetic ^ ^ £ afld optica] ^ ^ ^ 

FIG. 3 is a block diagram illustrating a architecture of a are connected to the system bus 23 by a hard disk drive 

Web crawler program, in accordance with the present inven- j(j interface 32, a magnetic disk drive interface 33, and an 

uon; optical drive interface 34, respectively. The drives and their 

FIG. 4 illustrates a data structure used to map addresses, associated computer-readable media provide nonvolatile 

in accordance with the present invention; and storage of computer readable instructions, data structures, 

FIG. 5 is a flow diagram illustrating the process of program modules and other data for the personal computer 

retrieving information pertaining to a Web document. 15 20. Although the exemplary environment described herein 

DFTAII FH nFSPRTPTTON op thf employs a hard disk, a removable magnetic disk 29 and a 

™pcn S removable optical disk 31, it should be appreciated by those 

PREFERRED EMBODIMENT skiUed in me art (hat otner types of computer-readable media 

The present invention is a mechanism for obtaining which can store data that is accessible by a computer, such 

information pertaining to Web documents that reside on one 20 as magnetic cassettes, flash memory cards, digital versatile 

or more server computers. A server computer is referred to disks, Bernoulli cartridges, random access memories 

as a Web site, and the process of locating and retrieving (RAMs), read only memories (ROM), and the like, may also 

digital data from Web sites is referred to as "Web crawling." be used in the exemplary operating environment. 

The mechanism of the invention uses a table to associate A number of program modules may be stored on the hard 

Web address prefixes with a corresponding prefix that, if 25 disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, 

substituted in the original address, may yield another including an operating system 35, one or more application 

address with a different protocol, network site, or path. The programs 36, other program modules 37, and program data 

crawler users the corresponding addresses or protocols to 38. A user may enter commands and information into the 

obtain information that supplements the data received by personal computer 20 through input devices such as a 

directly accessing the document using the documents pri- 30 keyboard 40 and pointing device 42. Other input devices 

mary address. ( no t shown) may include a microphone, joystick, game pad, 

In accordance with the present invention, a Web crawler satellite dish, scanner, or the like. These and other input 
program executes on a computer, preferably a general pur- devices are often connected to the processing unit 21 
pose personal computer. FTG. 1 and the following discussion through a serial port interface 46 that is coupled to the 
are intended to provide a brief, general description of a 35 system bus, but may be connected by other interfaces, such 
suitable computing environment in which the invention may as a parallel port, game port or a universal serial bus (USB), 
be implemented. Although not required, the invention will A monitor 47 or other type of display device is also 
be described in the general context of computer-executable connected to the system bus 23 via an interface, such as a 
instructions, such as program modules, being executed by a video adapter 48. One or more speakers 57 are also con- 
personal computer. Generally, program modules include 40 nected to the system bus 23 via an interface, such as an audio 
routines, programs, objects, components, data structures, adapter 56. In addition to the monitor and speakers, personal 
etc. that perform particular tasks or implement particular computers typically include other peripheraLoutput devices 
abstract data types. Moreover, those skilled in the art will (not shown), such as printers. 

appreciate that the invention may be practiced with other The personal computer 20 operates in a networked envi- 
computer system configurations, including hand-held 45 ronment using logical connections to one or more remote 
devices, multiprocessor systems, microprocessor-based or computers, such as remote computers 49 and 60. Each 
programmable consumer electronics, network PCs, remote computer 49 or 60 may be another personal 
minicomputers, mainframe computers, and the like. The computer, a server, a router, a network PC, a peer device or 
invention may also be practiced in distributed computing other common network node, and typically includes many or 
environments where tasks are performed by remote process- 50 all of the elements described above relative to the personal 
ing devices that are linked through a communications net- computer 20, although only a memory storage device 50 or 
work. In a distributed computing environment, program 61 has been illustrated in FIG. 1. The logical connections 
modules may be located in both local and remote memory depicted in FIG. 1 include a local area network (LAN) 51 
storage devices. and a wide area network (WAN) 52, Such networking 
With reference to FIG. 1, an exemplary system for imple- 55 environments arc commonplace in offices, enterprise -wide 
men ting the invention includes a general purpose computing computer networks, intranets and the Internet. As depicted in 
device in the form of a conventional personal computer 20, FIG, 1, the remote computer 60 communicates with the 
including a processing unit 21, a system memory 22, and a personal computer 20 via the local area network 51. The 
system bus 23 that couples various system components remote computer 49 communicates with the personal corn- 
including the system memory to the processing unit 21. The 60 puter 20 via the wide area network 52. 
system bus 23 may be any of several types of bus structures When used in a LAN networking environment, the per- 
including a memory bus or memory controller, a peripheral sonal computer 20 is connected to the local network 51 
bus, and a local bus using any of a variety of bus architec- through a network interface or adapter 53. When used in a 
tures. The system memory includes read only memory WAN networking environment, the personal computer 20 
(ROM) 24 and random access memory (RAM) 25. A basic 65 typically includes a modem 54 or other means for establish- 
input/output system 26 (BIOS), containing the basic routines ing communications over the wide area network 52, such as 
that helps to transfer information between elements within the Internet. The modem 54, which may be internal or 
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external, is connected to the system bus 23 via the serial port prefix corresponding to the primary address prefix. The Web f{j//*L*s$^ 
interface 46. In a networked environment, program modules crawler 206 uses the secondary address prefix to build a 
depicted relative to the personal computer 20, or portions secondary address. The secondary address is used to retrieve \ \LmJD 
thereof, may be stored in the remote memory storage device. information pertaining to a document, in order to augment or 1 r^^H 
It will be appreciated that the network connections shown 5 replace the information retrieved by using the primary ^ 
are exemplary and other means of establishing a communi- address. This process is described in further detail below, 
cations link between the computers may be used. As will be readily understood by those skilled in the art of 

FIG. 2 illustrates an architecture of a networked system in computer network systems, and others, the system illustrated 
which the present invention operates. A server computer 204 in FIG. 2 is exemplary, and alternative configurations may 
includes a Web crawler program 206 executing thereon. The 10 also be used in accordance with the invention. For example, 
Web crawler program 206 searches for Web documents the server computer 204 itself may include Web documents 
distributed on one or more computers connected to a com- 232 and 234 that are accessed by the Web crawler program 
puter network 216, such as the remote server computer 218 206. Also the Web crawler program 206, the indexing engine 
depicted in FIG. 2. The computer network 216 may be a 208, and the search engine 230 may reside on different 
local area network 51 (FIG. 1), a wide area network 52, or 15 computers. Additionally, the Web browser program and the 
a combination of networks that allow the server computer Web crawler program 206 may reside on a single computer. 
204 to communicate with remote computers, such as the Further, the indexing engine 208 and search engine 230 are 
remote server computer 218, either directly or indirectly. not required by the present invention. The Web crawler 
The server computer 204 and the remote server computer program 206 may retrieve Web document information for 
218 are preferably similar to the personal computer 20 20 usages other than providing the information to a search 
depicted in FIG. 1 and discussed above. engine. As discussed above, the client computer 214, the 

The Web crawler program 206 searches remote server server computer 204, and the remote server computer 218 
computers 218 connected to the network 216 for Web may communicate through any type of communication 
documents 222 and 224. The Web crawler 206 retrieves Web network or communications medium, 
documents and associated data. The contents of the Web 25 FIG, 3 illustrates, in further detail, a Web crawler program 
documents 222 and 224, along with the associated data, can 206 and related software executing on the server computer 
be used in a variety of ways. For example, the Web crawler 204 (FIG. 2) that performs Web crawling and indexing of 
206 may pass the information to an indexing engine 208, An information in accordance with the present invention. As 
indexing engine 208 is a computer program that maintains illustrated in FIG. 3, the Web crawler program 206 includes 
an index 210 of Web documents. The index 210 is similar to 30 a "gatherer" process 304 that performs crawling of the Web 
the index in a book, and contains reference information and and gathering of information pertaining to Web documents 
pointers to corresponding Web documents to which the The gatherer process 304 is invoked by passing it one or 
reference information applies. For example, the index may more starting URLs 306. The starting URLs 306 serve as 
include keywords, and for each keyword a list of addresses. seeds, instructing the gatherer process 304 where to begin its 
Each address can be used to locate a document that includes 35 Web crawling process. A starting URL can be a universal 
the keyword. The index may also include information other naming convention (UNC) directory, a UNC path to a file 
than keywords used within the Web documents. For or an HTTP path to a file. A URL, or Web document address' 
example, the index 210 may include subject headings or comprises specifications of a protocol, a domain, and a path 
category names, even when the literal subject heading or within the domain. The domain is also referred to as the host 
category name is not included within the Web document. 40 In one actual embodiment of the invention, the protocol and 
The type of information stored in the index depends upon the domain specifications form an address prefix As will be 
complexity of the search engine, which may analyze the understood by those skilled in the art of computer 
contents of the Web document and store the results of the programming, and others, the invention can be used with 
analvsi5 - different address schemes. 

A client computer 214, such as the personal computer 20 45 The gatherer process 304 inserts the starting URLs 306 
(FIG. 1), is connected to the server computer 204 by a into a transaction log 310, which maintains a list of URLs 
computer network 212. The computer network 212 may be that are currently being processed or have not yet been 
a local area network, a wide area network, or a combination processed. The transaction log 310 functions as a queue It 
of networks. The computer network 212 may be the same is called a log because it is preferably implemented as a 
network as the computer network 216 or a different network. 50 persistent queue that is written and kept on a disk to enable 
Die client computer 214 includes a computer program, such recovery after a system failure. Preferably, the transaction 
as a "browser" 215 that locates and displays documents to a queue maintains a small in-memory cache for quick access 
user. When a user at the client computer 214 desires to to the next transactions 

search for one or more Web documents, the client computer The gatherer process 304 also maintains a history map 
transmits data to a search engine 230 requesting a search. At 55 308, which contains an ongoing list of all URLs that have 
that time the search engine 230 examines its associated been searched during the current Web crawl. The gatherer 
index 210 to find documents that may be desired by the user. process 304 includes one or more worker threads 312 that 
The search engine 230 may then return a list of documents process each URL. Hie worker thread 312 retrieves a URL 
to the browser 215 at the client computer 214. The user may from the transaction log 310 and passes the URL to a filter 
then examine the list of documents and retrieve one or more 60 daemon 314. The filter daemon 314 is a process that uses the 
desired Web documents from remote computers such as the URL to retrieve the Web document at the address specified 
remote server computer 218. by lhc URL< ^ filtcr daemon 3M uscs ^ 

The Web crawler program 206 maintains an address map specified by the URL to retrieve the Web document. For 
226. The address map 226 is a simple database that contains example, if the access method is HTTP, the filter daemon 
a list of Web document address prefixes. For each Web 65 314 uses HTTP commands to retrieve the document If the 
document address prefix, referred to as a "primary" address access method specified is FILE, the filter daemon uses file 
prefix, the address map 226 contains a "secondary" address system commands to retrieve the corresponding documents. 
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The File Transfer Protocol (FTP) is another other well retrieval to data obtained using the secondary address. The 

known access method that the filter daemon may use to data obtained using the secondary address is passed to the 

retrieve a document. Other access protocols, such as data- indexing engine 208. The process of using an address map 

base retrieval specifications, may also be used in conjunc- 226 is illustrated in FIG. 5 and discussed in further detail 

tion with the invention. 5 below. 

After retrieving a Web document, the filter daemon parses , FIG - 4 illustrates an exemplary address map 226. As 

the Web document and returns a list of text and properties. illustrated in FIG. 4, the address map 226 includes a hash 

An HTML document includes a sequence of "tags/* each tag tab J e 322 * Each ta ^ le entrv 324 P oints 10 a lisl of 

containing some information. The information may be text addre l f 1 mapping? 328, 330 332, 334. The use of hash tables 

that is to be displayed in the Web browser program 215 io ^ weU known m the art and is not discussed in de tad herein, 

(FIG. 2). The information may also be "metadata" that e ^pt as necessary to explain the mvention. 

describes the formatting of text. The information within tags ^ address mapping entry 328 contains a primary address 

may also contain hyperlinks to other Web documents A *Z 336 & f "f 0 * 1 "? addresS prefi ? 338 * ? e F™* 

hyperlink mcludes a specification of a We^ SfaV^ 

tag containing a hyperlink is an image, the Web browser is address prefix 338 also contains a protocol specification 344 

program 215 uses the hyperlink to retrieve the image and and a to £. level domain specification 346. Usmg an address 

render it on the Web page. Sumlarly, the hyperlink may mapping entry 3M> 330) 332> or 334> the WOfk * thread m 

specify the address of audio data. If a hyperlink pomts to retrieves a primary address prefix from a URL, and finds a 

audio data, the Web browser program retrieves the audio corresponding secondary address prefix 338 in the address 

data and plays it. 20 map 320. The worker thread then creates a second, complete 

An "anchor" tag specifies a visual element and a hyper- address by replacing the primary address prefix in the 

link. The visual element may be text or a hyperlink to an original URL with the secondary address prefix, as discussed 

image. When a user selects an anchor having an associated below. 

hyperlink in a Web browser program 215, the Web browser An address prefix can also include a directory specifica- 

program automatically retrieves a Web document at the 25 uon - If a primary address prefix 336 includes a directory 

address specified in the hyperlink. specification, the corresponding secondary address prefix 

Tags may also contain information intended for a search includes a directory specification, which is used to create the 

engine. For example, a tag may include a subject or category secondary address. An address prefix may further include a 

within which the Web document falls, to assist search file specification. In such a situation, the corresponding 

engines that perform searches by subject or category. The 30 secondary address prefix specifies the entire secondary 

information contained in tags is referred to as "properties" of address. As depicted in FIG. 4, the address mapping 334 

the Web document. AWeb document is therefore considered includes a primary address prefix 350 that comprises a 

to be made up of a set of properties and text. The filter protocol specification 352, a top level domain specification 

daemon 314 returns the list of properties and text within a 35< *> md a directory specification 356. The correspondence 

Web document to the worker thread 312. 35 secondary address specification 358 comprises a protocol 

As discussed above, a Web document may contain one or specification 360, a top level domain specification 362, and 

more hyperlinks. Therefore, the list of properties includes a a diTCCior y specification 364. 

list of URLs that are included in hyperlinks within the Web ^ address ma P 226 can be created and maintained in 

document. The worker thread 312 passes this list of URLs to several ways. For example, a user can manually enter a list 

the history map 308. The history map 308 checks each URL 40 of P rimarv address prefix and second address prefix pairings 

to determine if it is already listed within the history map. ea< : n time a ncw entrv * desired. Alternatively, a user can 

URLs that are not already listed are added to the history map write a computer program that generates address mappings 

and are also added to the transaction log 310, to be subsc- between a FILE protocol and an HTTP protocol when the 

quendy processed by a worker thread. scrvcr 204 ( p I G - 2) is a local server. 

The worker thread 312 then passes the list of properties 45 FIG ' 5 illustrates a » exemplary process 502 of retrieving 

and text to the indexing engine 208. The indexing engine and storin g data us^g ^ e address map of the present 

208 creates an index 210, which is used by the search engine invention. At a step 504, the worker thread 312 (FIG. 3) 

230 in subsequent searches. retrieves a URL from the traasaction log 310. At a step 506, 

In accordance with the present invention, the Web crawler 50 I** ^ f™?]" relrieVe S *, Web J 00 ?^ USing the pr °" 

also maintains an addresk map 226 that contains a set of and address S ? ecifi l ed * * e URL - In one actual 

address prefix pairs. Each address prefix pair contains a ' mbodiment » ^ 0T ^[ ^ d 312 P asscs *e URL to the 

primary address prefix, which forms a portion of a primary fiIter daemon 314, winch retrieves the Web document. This 

address, and a secondary address prefix. The secondary step " P "™ C l^ ing * e SpCclficd filc rct ncval protocol, 

address prefix is substituted for the primary address prefix in 55 f?> ° T Step 506 * ho includes 

a primary address to create a secondary address and obtain 1 } C data ^ *cJWbb document. TTic retrieval of 

information pertaining to the Web document located at the da ' a * cludes P arsin g the document identifying each tag, 

primary address. In one actual implementation, the address ltC ™ g T?? *7 ^t™* data '. At a StCp 5 ° 8 ' thc 
map 226 is implemented as a hash table, and primary 

address prefixes are hashed to locate entries within the table 60 ^ & spec fi lfied ^ f the H^T dlscussed a , bove ' firet 

n „ • „ • e address prefix therefore includes a protocol, a top level 

to f^£^^^ u Pfim . ary ^h"^™* 118 domain > ™ oplional ^deification, and a fileW 

lli^it ^ f £ crawler checks the address map M URL5 are t ical] ^ chara ' cters ior to the . 

226 to determine if there is an associated secondary address fircl 'J* m , tU * , r , f F L T ' r 

™a„ Tf *u -i j j ^ . . , nrst colon specify the protocol. For example, m the URL 

prefix. If there is a secondary address prefix, the Web crawler r y ' 

retrieves data pertaining to the Web document by using a 65 littp://www.micro5oft.com/doca/pagci.html 

secondary address. The Web crawler may retrieve the Web "http" specifies the protocol. The top level domain is speti- 

document using the primary address, or it may limit the data fied by the character string between "://" and the next single 
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slash. In the above example, "www.microsoft.com" is the 5, the secondary address prefix corresponding to the longest 

top level domain. The remainder of the URL specifies a address prefix of a URL is used if there are more than one 

directory path and a file name. In the above example, primary address prefixes for a URL that have entries in the 

"docs/pagel.html" specifies the path to a file named address map. 

"pagel.html." 5 FIG. 5 illustrates a process 502 of obtaining information 

At a step 510, a determination is made of whether the pertaining to a single address. As discussed above, a Web 

address prefix is in the address map 226 as a primary address crawler repeats this process for many URLs, as it uses the 

prefix 336 (FIG. 4). If it is, at a step 512, the secondary links within cach Web document to traverse a network of 

address prefix 338 corresponding to the primary address Web docu rnents. 

prefix 336 is retrieved. At a step 514, a secondary address is 10 While the P rcferrcd embodiment of the invention has been 

built by combining the primary address with the secondary ^strated and described, it will be appreciated that various 

address prefix. When the primary address prefix is the entire changes can be made therein without departing from the 

primary address, the secondary address prefix becomes the s P lrit and of ^ invention. 

secondary address. Although the invention as described ^ embodiments of the invention in which an exclusive 
utilizes address prefixes that may comprise a portion of an is P ro P ert y or privilege is claimed are defined as follows: 

address, the mechanism of the invention may be applied to , 1- A computer-based method of retrieving Web document 

address mappings where each entry in the address map has information from a computer network, comprising: 

a primary address and a corresponding secondary address. retrieving a Web document from a computer network 

A primary address can have a plurality of corresponding a A* 5 * protocol included in a primary document 

addresses that are used to obtain data. For example, an entry 20 address specification; 

in the address map may include a primary address prefix, a obtaining data from the Web document; 

secondary address prefix, and a tertiary address prefix. The determining whether the primary document address speci- 

numbcr of addresses corresponding to a primary address fication has a corresponding secondary document 

may vary with each primary address. It should be readily address specification; and 

apparent to one skilled in the art of computer programming, is if the primary document address specification has a cor- 

and others, that the process 502 of retrieving and storing data responding secondary document address specification, 

can be modified to accommodate multiple and varying retrieving supplementary data from the computer net' 

numbers of addresses corresponding to a primary address. work pertaining to the Web document using a second 

After building a new URL, at a step 516, the worker protocol included in the secondary document address 

thread 312 passes the secondary address to the filter daemon 30 specification. 

314, which retrieves additional data using the secondary 2. The method of claim 1, wherein the primary document 

address. The secondary data may be a Web document or address specification includes the first protocol and a first 

system level data pertaining to the document. For example, network address, and the secondary document address sped- 

if the secondary address uses the protocol FILE, at step 516, fication includes the second protocol and a second network 
the mechanism of the invention may retrieve such data as the 35 address, and wherein the first protocol is different from the 

time stamp of the last file update or an access control list. An second protocol. 

access control list specifies the set of users that has security 3. The method of claim 2, wherein the first protocol is 

access to the file. HTTP and the second protocol is FILE. 

At step 518, the data retrieved at step 516 is combined 4. The method of claim 3, further comprising retrieving an 
with the data retneved at step 506. At step 520, the combined 40 access control list corresponding to the Web document by 

data is used. For example, as illustrated in FIG. 3, the worker using the secondary document address specification, 

thread may pass the combined data to an indexing engine 5. The method of claim 4, further comprising storing the 

208 for insertion into an index 210, to be subsequently used supplementary data pertaining to the Web document 

by a search engine 230. retrieved using the second protocol from the secondary 

If at the step 510 it is determined that the URL prefix does 45 document address specification with the data retrieved from 

not exist in the address map as a primary address prefix, at the Web document using the first protocol from the primary 

a step 522, the worker thread 312 obtains the next address document address specification in a document index, 

prefix by reducing the address prefix from the right side until 6. The method of claim 1, wherein the primary document 

a slash is found. At a step 524, a determination is made of address specification includes the first protocol and a first 
whether a top level domain specification still remains in the 50 network address, and the secondary document address speci- 

new address prefix. If it does not, flow control proceeds to fication includes the second protocol and a second network 

step 520, where the data retrieved at step 506 is entered into address, and the first network address is different from the 

the index 210, without secondary data. If, at the step 524, a second network address. 

top level domain still remaias in the new address prefix, flow 7. The method of claim 6, wherein retrieving supplemen- 
control returns to the step 510, to search for the new address 55 tary data pertaining to the Web document by using the 

prefix in the address map. secondary document address specification comprises 

After the address prefix is reduced to exclude the file retrieving a second Web document that includes supplemen- 

specification, at the step 514, the retrieved secondary tary data pertaining to the Web document, 

address prefix is combined with the directory and file 8. The method of claim 7, wherein retrieving the second 
specification "below" the address prefix to build a new 60 Web document includes using the hypertext transfer proto- 

secondary URL. For example, if the URL is "http:// col (HTTP) to retrieve the second Web document. 

www.microsoft.com/docs/pagcl.html," the address prefix 9. The method of claim 7, wherein retrieving the second 

being searched at the step 510 is "http:// Web document includes using a database specification to 

www.microsoft.corn" and the secondary address prefix cor- retrieve the second Web document, 
responding to the primary address prefix is "file://msserver," 65 10. The method of claim 1, wherein determining whether 

the new secondary URL is "file://msserver/docs/ the primary document address specification has a corre- 

pagel.html." In the exemplary process 502 depicted in FIG. sponding secondary document address specification 
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includes determining whether an entry corresponding to the 
primary document address specification exists in an address 
map. 

11. The method of claim 10, wherein the entry corre- 
sponding to the primary document address specification 
includes a transfer protocol specification and a top level 
domain specification. 

12. The method of claim 1, further comprising: 
determining whether the primary document address speci- 
fication has a corresponding tertiary document address 
specification; 

if the primary document address specification has a cor- 
responding tertiary document address specification, 
retrieving further data pertaining to the Web document 
by using the tertiary document specification; and 

if the primary document address specification has a cor- 
responding tertiary document address specification, 
storing the further data pertaining to the Web document 
obtained using the tertiary document address specifi- 
cation with the data obtained from the Web document. 

13. The method of claim 1, wherein the secondary docu- 
ment address specification is automatically built by replac- 
ing a secondary address prefix for a primary address prefix 
in the primary document address specification. 

14. The method of claim 13, further comprising: 

(a) obtaining a URL from a transaction log; 

(b) parsing the URL into a URL prefix and URL suffix; 

(c) providing an address map containing a plurality of 
primary address prefixes and corresponding secondary 30 
address prefixes; 

(d) determining if the URL prefix is included in the 
address map as a primary address prefix; 

(i) if the URL prefix is included in the address map as 
a primary address prefix, combining a secondary 
address prefix that corresponds to the primary 
address prefix with the URL suffix to build the 
secondary document address specification; and 

(ii) if the URL prefix is not included in the address map 

as a primary address prefix, changing the parsing of 40 
the URL to incrementally reduce the URL prefix and 
increase the URL suffix and then repeating this 
paragraph (d). 

15. A computer-based method of retrieving information 
from a computer network during a network crawl, compris- 
ing: 

retrieving an electronic document from the computer 
network, the electronic document including at least one 
hyperlink specification including a primary document 
address specification; 

retrieving at least one primary document address speci- 
fication from the electronic document using a first 
protocol included in the primary address specification, 
each primary document address corresponding to a 
linked electronic document; 

determining whether the primary document address speci- 
fication has a corresponding secondary document 
address specification; 

if the primary document address specification has a cor- 
responding secondary document address specification, 
retrieving supplementary data pertaining to the linked 
electronic document from the computer network using 
a second protocol included in the secondary document 
address specification; and 

if the primary document address specification has a cor- 
responding secondary document address specification, 
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storing the supplementary data pertaining to the linked 
electronic document obtained using the secondary 
document address specification and associating the 
stored supplementary data pertaining to the linked 
electronic document with the primary document 
address specification. 

16. The method of claim 15, wherein the primary docu- 
ment address specification includes the first protocol and a 
first network address, and the secondary document address 
specification includes the second protocol and a second 
network address, and wherein the first protocol is different 
from the second protocol. 

17. The method of claim 15, wherein the primary docu- 
ment address specification includes the first protocol and a 
first network address, and the secondary document address 
specification includes the second protocol and a second 
network address, and the first network address is different 
from the second network address. 

18. The method of claim 15, wherein determining whether 
the primary document address specification has a corre- 
sponding secondary document address specification 
includes determining whether an entry corresponding to the 
primary document address specification exists in an address 
map. 

19. The method of claim 15, further comprising: 
retrieving data from the linked electronic document using 

the primary document address specification; 
storing the data retrieved from the linked electronic 
document using the primary document address speci- 
fication; and 

associating the data retrieved from the linked electronic 
document using the primary document address speci- 
fication with the supplementary data pertaining to the 
linked electronic document retrieved using the second- 
ary document address specification. 

20. The method of claim 15, further comprising: 
automatically retrieving a plurality of primary document 

address specifications from a plurality of hyperlinks 
included in the electronic document; 

automatically retrieving a plurality of secondary docu- 
ment address specifications corresponding to said plu- 
rality of primary document address specifications; and 

automatically retrieving supplementary data pertaining to 
the linked electronic document using said plurality of 
secondary document address specifications, 

21. The method of claim 15, wherein the secondary 
address specification is automatically built by replacing a 
secondary address prefix for a primary address prefix in the 
primary document address specification. 

22. The method of claim 21, further comprising: 

(a) obtaining a URL from a transaction log; 

(b) parsing the URL into a URL prefix and URL suffix; 

(c) providing an address map containing a plurality of 
primary address prefixes and corresponding secondary 
address prefixes; 

(d) determining if the URL prefix is included in the 
address map as a primary address prefix; 

(i) if the URL prefix is included in the address map as 
a primary address prefix, combining a secondary 
address prefix that corresponds to the primary 
address prefix with the URL suffix to build the . 
secondary address specification; and 

(ii) if the URL prefix is not included in the address map 
as a primary address prefix, changing the parsing of 
the URL to incrementally reduce the URL prefix and 
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increase the URL suffix and then repeating this 
paragraph (d). 
23. A system for performing a Web crawl, the system 
comprising: 

a server computer having a Web crawler program execut- 5 
ing thereon; 

an address map accessible to the Web crawler program 
and containing a plurality of primary Web addresses 
and a plurality . of secondary Web addresses, each 
primary Web address having a corresponding second- 10 
ary Web address; 

the primary Web address including a first protocol for the 
retrieval of a Web document at the primary Web 
address; 15 

the secondary Web address including a second protocol 
for the retrieval of a Web document at the secondary 
Web address; 

the second protocol for the retrieval of a Web document 
at the secondary Web address being difFerent than the 20 
first protocol for the retrieval of a Web document at the 
primary Web address; 

a computer network including at least one Web server 
having a plurality of Web documents stored thereon, 
each Web document having a corresponding primary 25 
Web address; 

a database containing information pertaining to the plu- 
rality of . Web documents; 

program code for: 3Q 
retrieving a primary Web address corresponding to one 
of the Web documents; 
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determining whether the primary Web address has a 
corresponding secondary Web address; 

selectively retrieving supplementary information per- 
taining to said one of the Web documents using the 
corresponding secondary Web address; and 

if the supplementary information pertaining to said one 
of the Web documents is retrieved using the second- 
ary Web address, storing said supplementary infor- 
mation in the database. 

24. The system of claim 23, further comprising a search 
engine that performs a Web search using the database. 

25. The system of claim 24, wherein the first protocol is 
HTTP, the second protocol is FILE, and the retrieving of 
supplementary information pertaining to said one of the Web 
documents using the secondary Web address includes using 
file system commands to retrieve the supplementary infor- 
mation pertaining to said one of the Web documents. 

26. The system of claim 25, wherein the supplementary 
information pertaining to said one of the Web documents 
includes an access control list. 

27. The system of claim 23, wherein each primary Web 
address includes a data transfer protocol specification and a 
lop level domain specification, and each secondary Web 
address includes a data transfer protocol specification and a 
top level domain specification. 

28. The system of claim 23, wherein the address map and 
said plurality of Web documents reside on different com- 
puters. 

***** 
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